Developing an Integrated and Comprehensive Traditional Chinese Corpus Based on Multi-Character Words for Studying relations between words and lexicons

نویسندگان

Chung-Ching Wang

Sau-chin Chen

Yueh-Lin Tsai

Yong-Ru Hsiao

Jon-Fan Hu

چکیده

Most of Chinese corpus were created for single-character words with indexes, such as frequency, stroke number, and phonetic information, for the purposes of basic research. However, multi-character Chinese words are recognized of referring alterations of meaning and more useful for investigating reading processes and comprehension. Therefore, for studying complete relations between words and lexicons of Chinese, a corpus requires statistics based on more than single-character words with valid and reliable indexes. In this study, we illustrate a corpus of Traditional Chinese providing five word indexes, including word sound, word position, word form, semantics, and competence of forming multi-character words by integrating current credible corpus. The integration approach of the present study is beneficial not only for minimizing inconsistencies of word entities between corpus, but also for calculating quantitative properties of character-to-character relationship. The utilization of the present corpus will significantly impact the studies of Chinese words and reading comprehension.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Text1

Put forward a new method about automatic Chinese text segmentation based on Chinese characters string (CCS) frequency and length descending. It can automatically segment meaningful CCS in text based on processing longer string first and string frequency information, with no thesaurus, no acquiring the probability between words in advance and no Chinese character index. This method can effective...

متن کامل

A new model for persian multi-part words edition based on statistical machine translation

Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...

متن کامل

Lexicon adaptation with reduced character error (LARCE) - a new direction in Chinese language modeling

Good language modeling relies on good predefined lexicons. For Chinese, since there are no text word boundaries and the concept of “word” is not very well defined, constructing good lexicons is difficult. In this paper, we propose lexicon adaptation with reduced character error (LARCE), which learns new word tokens based on the criterion of reduced adaptation corpus error rate. In this approach...

متن کامل

Word Segmentation for Chinese Wikipedia Using N-Gram Mutual Information

In this paper, we propose an unsupervised segmentation approach, named "n-gram mutual information", or NGMI, which is used to segment Chinese documents into ncharacter words or phrases, using language statistics drawn from the Chinese Wikipedia corpus. The approach alleviates the tremendous effort that is required in preparing and maintaining the manually segmented Chinese text for training pur...

متن کامل

Build Chinese Emotion Lexicons Using A Graph-based Algorithm and Multiple Resources

For sentiment analysis, lexicons play an important role in many related tasks. In this paper, aiming to build Chinese emotion lexicons for public use, we adopted a graph-based algorithm which ranks words according to a few seed emotion words. The ranking algorithm exploits the similarity between words, and uses multiple similarity metrics which can be derived from dictionaries, unlabeled corpor...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2015

Developing an Integrated and Comprehensive Traditional Chinese Corpus Based on Multi-Character Words for Studying relations between words and lexicons

نویسندگان

چکیده

منابع مشابه

An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Text1

A new model for persian multi-part words edition based on statistical machine translation

Lexicon adaptation with reduced character error (LARCE) - a new direction in Chinese language modeling

Word Segmentation for Chinese Wikipedia Using N-Gram Mutual Information

Build Chinese Emotion Lexicons Using A Graph-based Algorithm and Multiple Resources

عنوان ژورنال:

اشتراک گذاری